We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].
translated by 谷歌翻译
Our long term goal is to use image-based depth completion to quickly create 3D models from sparse point clouds, e.g. from SfM or SLAM. Much progress has been made in depth completion. However, most current works assume well distributed samples of known depth, e.g. Lidar or random uniform sampling, and perform poorly on uneven samples, such as from keypoints, due to the large unsampled regions. To address this problem, we extend CSPN with multiscale prediction and a dilated kernel, leading to much better completion of keypoint-sampled depth. We also show that a model trained on NYUv2 creates surprisingly good point clouds on ETH3D by completing sparse SfM points.
translated by 谷歌翻译
Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolution and periodic representation, our experiments show that using QFF can result in smaller model size, faster training, and better quality outputs for several applications, including Neural Image Representations (NIR), Neural Radiance Field (NeRF) and Signed Distance Function (SDF) modeling. QFF are easy to code, fast to compute, and serve as a simple drop-in addition to many neural field representations.
translated by 谷歌翻译
We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.
translated by 谷歌翻译
我们提出了一个混合框架Oppinn:物理知识的神经网络(PINN),其中运算符学习以近似于Fokker-Planck-Landau(FPL)方程的解决方案。 Oppinn框架分为两个步骤:步骤1和步骤2。在步骤1期间对操作员替代模型进行训练后,PINN可以使用预训练的替代模型在步骤2期间有效地近似于FPL方程。操作员替代模型可大大降低计算成本,并通过近似FPL方程中的复杂Landau碰撞积分来提高PINN。操作员替代模型也可以与传统的数值方案结合使用。当速度模式变大时,它在计算时间中提供了高效率。使用Oppinn框架,我们在各种类型的初始条件下为FPL方程提供了神经网络解决方案,并在两个和三个维度中提供相互作用模型。此外,基于FPL方程的理论属性,我们表明,随着预定义的损耗函数的降低,近似的神经网络解决方案会收敛到FPL方程的先验经典解。
translated by 谷歌翻译
半监督学习(SSL)的最新最新方法将一致性正则化与基于置信的伪标记结合在一起。为了获得高质量的伪标签,通常采用高置信度阈值。但是,已经表明,对于远离训练数据的样本,深网的基于软磁性的置信度得分可能很高,因此,即使是高信心不明的样品,伪标签也可能仍然不可靠。在这项工作中,我们提出了伪标记的新观点:而不是依靠模型信心,而是衡量未标记的样本是否可能是“分布”;即,接近当前的培训数据。为了对未标记的样本进行分类是“分布”还是“分发”,我们采用了分布外检测文献中的能量评分。随着培训的进行进展,更不标记的样品成为分配并有助于培训,标记和伪标记的数据可以更好地近似于真正的分布以改善模型。实验表明,我们的基于能量的伪标记方法,尽管从概念上讲简单,但在不平衡的SSL基准测试方面显着优于基于置信的方法,并在类平衡的数据上实现了竞争性能。例如,当不平衡比率高于50时,它会在CIFAR10-LT上产生4-6%的绝对准确性提高。当与最新的长尾SSL方法结合使用时,可以实现进一步的改进。
translated by 谷歌翻译
从自然语言监督中学习视觉表示,最近在许多开创性的作品中表现出了巨大的希望。通常,这些具有语言的视觉模型表现出对各种数据集和任务的强大可传递性。但是,由于缺乏易于使用的评估工具包和公共基准,评估这些模型的可转让性仍然很具有挑战性。为了解决这个问题,我们构建了高级版(评估语言的视觉任务级传输),这是用于评估(预训练)语言增强视觉模型的第一个基准和工具包。升华由三个组成部分组成。 (i)数据集。作为下游评估套件,它由20个图像分类数据集和35个对象检测数据集组成,每个数据集都用外部知识来增强。 (ii)工具包。开发了自动高参数调谐工具包,以促进下游任务的模型评估。 (iii)指标。多种评估指标用于测量样品效率(零射击和少量)和参数效率(线性探测和完整模型微调)。我们在https://computer-vision-in-the-wild.github.io/elevater/上公开发布leverater
translated by 谷歌翻译
蒙面自动编码在图像和语言领域的自我监督学习方面取得了巨大的成功。但是,基于面具的预处理尚未显示出对点云理解的好处,这可能是由于PointNet(PointNet)无法正确处理训练的标准骨架,而不是通过训练期间掩盖引入的测试分配不匹配。在本文中,我们通过提出一个判别性掩码式变压器框架,maskPoint}来弥合这一差距。我们的关键想法是将点云表示为离散的占用值(1如果点云的一部分;如果不是的,则为0),并在蒙版对象点和采样噪声点之间执行简单的二进制分类作为代理任务。这样,我们的方法是对点云中的点采样差异的强大,并促进了学习丰富的表示。我们在几个下游任务中评估了验证的模型,包括3D形状分类,分割和现实词对象检测,并展示了最新的结果,同时获得了明显的预读速度(例如,扫描仪上的4.1倍)先前的最新变压器基线。代码可在https://github.com/haotian-liu/maskpoint上找到。
translated by 谷歌翻译
潜水员在NERF的关键思想和其变体 - 密度模型和体积渲染的关键思想中建立 - 学习可以从少量图像实际渲染的3D对象模型。与所有先前的NERF方法相比,潜水员使用确定性而不是体积渲染积分的随机估计。潜水员的表示是基于体素的功能领域。为了计算卷渲染积分,将光线分为间隔,每个体素;使用MLP的每个间隔的特征估计体渲染积分的组件,并且组件聚合。结果,潜水员可以呈现其他集成商错过的薄半透明结构。此外,潜水员的表示与其他这样的方法相比相对暴露的语义 - 在体素空间中的运动特征向量导致自然编辑。对当前最先进的方法的广泛定性和定量比较表明,潜水员产生(1)在最先进的质量或高于最先进的质量,(2)的情况下非常小而不会被烘烤,(3)在不被烘烤的情况下渲染非常快,并且(4)可以以自然方式编辑。
translated by 谷歌翻译